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(57) ABSTRACT 

Data integrity and availability is assured by preventing a 
□ode of a distributed, clustered system from accessing 
shared data in the case of a failure of the node or commu- 
nication hnks with the node. The node is prevented from 
accessing the shared data in the presence of such a failure by 
ensuring that such a failure is detected in less time than a 
secondary node would allow user I/O activities to com- 
mence after reconfiguration. The prompt detection of failure 
is assured by periodically determining which configuration 
of the current cluster each node believes itself to be a 
member of Each node maintains a sequence number which 
identifies the current configuration of the cluster. 
Periodically, each node exchanges its sequence number with 
all other nodes of the cluster If a particular node detects that 
it believes itself to be a member of a preceding configuration 
to that to which another node belongs, the node determines 
that the cluster has been reconfigured since the node last 
performed a reconfiguration. Therefore, the node must no 
longer be a member of the cluster. The node then refrains 
from accessing shared data. In addition, if a node suspects a 
failure in the cluster, the node broadcasts a reconfigure 
message to all other nodes of the cluster through a public 
network. Since the messages are sent through a public 
network, failure of the private communications links 
between the nodes does not prevent receipt of the reconfig- 
ure messages. 

9 Claims, 3 Drawing Sheets 
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DATA INTEGRITY AND AVAILABILTTY IN A vations as those are, in fact, a super-set of SCSI-2 reserva- 

DISTRIBUTED COMPUTER SYSTEM tions. It is important to note that we assume that the nodes 

that could possibly form the cluster, regardless of whether 

FIELD OF THE INVENTION they are current cluster members or not, do not behave 

The present invention relates to fault tolerance in distrib- ^ mahciously. Tlie cluster does not employ any mechanisms, 

uted computer systems and, in particular, to a particularly o^*^^' ^^^^ reqmnng root pnvileges for the user, to prevent 

robust mechanism for assuring integrity and availability of ^ahcious adversanes from gainmg membership in the clus- 

data in the event of one or more failures of nodes and/or corrupting the shared database, 

resources of a distributed system. ^hile it is easy to fence a shared storage device if that 

10 device is dual-ported, via the SCSI-2 reservations, no such 

BACKGROUND OF THE INVENTION technique is available for multi-ported devices, as the nec- 

1 . Introduction and Background essary but optional SCSI-3 reservations are not implemented 
Main motivations for distributed computer systems are by the disk drive vendors. In this paper we assume that the 

the addition of high availability and increased performance. storage sub -system is entirely composed of either dual- 
Distributed computer systems support a host of highly- 15 ported or multi-ported devices. Mixing the dual and multi- 
available or parallel applications, such as Oracle Parallel ported storage devices does not add any complexity to our 
Server (OPS) Informix XPS or HA- NFS. While the key to algorithms, but will make the discussion more difficult to 
the success of distributed computer systems in the high-end follow. It should be pointed out, however, that a multi-ported 
market has been high-availability and scalable performance, storage sub-system has better availability characteristics 
another key has been the implicit guarantee that the data 20 than a dual-ported one as more node failures can be tolerated 
trusted to such distributed computer system will remain in the system without loss of access to the data. Before we 
integral. investigate the issues of availability, data integrity, and 

2. System Model and Classes of Failures performance we should classify the nature of faults that our 
The informal system model for some distributed computer systems are expected to tolerate. 

systems is that of a "trusted" asynchronous distributed 25 There are many possible ways of classifying the types of 

system. The system is composed of 2 to 4 nodes that failures that may occur in a system. The following classifi- 

communicate via message exchange on a private commu- cation is based on the intent of the faulty party. The intent 

nication network. Each node of the systems has two paths defines the nature of the faults and can lead the system 

into the communication medium so that the failure of one designer to guard against the consequences of such failures, 

path does not isolate the node from other cluster members. 30 Three classes of failures can be defined on the basis of the 

The system has a notion of membership that guarantees that intent: 

the nodes come to a consistent agreement on the set of 1 . No-Fault Failures: This class of failures include various 
member nodes at any given time. The system is capable of hardware and software failures of the system such as node 
tolerating failures and the failed nodes are guaranteed to be failures, the communication medium failures, or the storage 
removed from the cluster within bounded time by using 35 sub -system failures. All these failures share the character- 
fail-fast drivers and timeouts. The nodes of the distributed istic that they are not the result of any misbehavior on the 
system are also connected to an external and public network part of the user or the operator. A highly- available system is 
that connects the client machines to the cluster. The storage expected to tolerate some such failures. The degree to which 
subsystem may contain shared data that may be accessed a system is capable of tolerating such failures and the affect 
from different nodes of the cluster simultaneously. The 40 on the users of the system determine the class (e.g, fault- 
simultaneous access to the data is governed by the upper tolerant or highly-available) and level (how many and what 
layer applications. Specifically, if the application is OPS type of failures can be tolerated simultaneously or 
then simultaneous access is granted as OPS assumes that the consecutively) of availabihty in a traditional paradigm, 
underlying architecture of the system is a shared-disk archi- 2. Inadvertent Failures: The typical failure in this class is 
tecture. All other applications that are supported on the 45 that of an operator mistake or pilot error. The user that 
cluster assume a shared nothing architecture and therefore, causes a failure in this class does not intend to damage the 
do not require simultaneous access to the data. system, however, he or she is relied upon to make the right 
While the members of a cluster are "trusted", nodes that decisions and deviations from those decisions can cause 
are not part of the current membership set are considered significant damage to the system and its data. The amount of 
"un-trusted". The un-trusted nodes should not be able to 50 protection the system incorporates against such failures 
access shared resources of the cluster. These shared defines the level of trust that exists between the system and 
resources are the storage subsystem and the communication its users and operators. A typical example of this trust in a 
medium. While the access to the communication medium UNIX environment is that of a user with root privileges that 
may not pose a great threat to the operation of the cluster, is relied upon to behave responsibly and not delete or modify 
other than a possible flooding of the medium by the offend- 55 files owned and used by other users. Some distributed 
ing node(s), the access to the shared storage sub-system systems assume the same level of trust as the operating 
constitutes a serious danger to the integrity of the system as system and restricts all the activities that can affect other 
an un-trusted node may corrupt the shared data and com- nodes or users and their data to a user with root privileges, 
promise the underlying database. To fence non-member 3. Malicious Failures: This is the most difficult class of 
nodes from access to the storage sub-system the nodes that 60 failures to guard against and is generally solved by use of 
are members of the cluster exclusively reserve the parts of authenticated signatures or similar security techniques. Most 
the storage sub -system that they "own". This results in systems that are vulnerable to attacks by maficious users 
exclusion of all other nodes, regardless of their membership must take extra security measures to prevent access to such 
statiis, from accessing the fenced parts of the storage sub- users. Clusters, in general, and Sun Cluster 2.0 available 
system. The fencing has been done, up to now, via low level 65 from Sun Microsystems, Inc. of Palo Alto, Calif., in 
SCSI-2 reservation techniques, but it is possible to fence the particular, are typically used as back-end systems and are 
shared data by the optional SCSI-3 persistent group reser- assumed immune from malicious attacks as they are not 
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directly visible to users outside the local area network in FIG. 2 is a block diagram of a distributed computer 

which they are connected to a selected number of clients. As system which includes a multi-ported device, 

an example of this lack of security, consider a user that can piG. 3 is a logic flow diagram of the determination that a 

break into a node and then joins that node as a member of node has failed in accordance with the present invention, 

cluster of a distributed system. The malicious user can now 5 pjQ 4 ^ ^^^^^ diag^zm of the broadcasting of a 

cormpt the database by writing on the shared data and not reconfiguration message in accordance with the present 

lollowmg the appropriate protocols, such as acqmnng the invention 
necessary data locks. This lack of security software is due to 

the fact that some distributed systems are generally assumed DETAILED DESCRIPTION 

to operate in a "trusted" environment. Furthermore, such 10 3. Dual-Ported Architectures 

systems are used as high-performance data servers that A system with dual-ported storage sub-system is shown in 

cannot tolerate the extra cost of running security software FIG. 1. ■^In~this-fig\ire,~the-stprage - devicc-lll-is ^ac^ssible 

required to defend the integrity of the system from attack by ^fron^nodes 102 A and 102D rstorage~Hevice'112 is accessible 

malicious users. from nodes 10!2A and 102B, storage device 113 from 102B 

Note that the above classification is neither comprchen- 15 and 102C, and storage device 114 from 102C and 102D. 

sive nor that the classes are distinct. However, such classi- Note that in this configuratioD if the two nodes that are 

fication serves as a model for discussing the possible failures connected to a storage device fail, then the data on that 

in a distributed system. As mentioned earlier, we must guard storage device is no longer accessible. While the use of 

against no-fault failures and make it difficult for inadvertent mirrored data can overcome this limitation in availability, 

failures to occur. We do not plan to incorporate any tech- 20 the required mirroring software has the inherent limitation 

□iques in system software to reduce the probability of that it reduces the maximum performance of the system, 

malicious users from gaining access to the system. Instead, Furthermore, note that not all the data is local to every node, 

we can offer third party solutions that disallow access to This impUes that a software solution is needed to allow 

potentially malicious parties. One such solution is the Fire remote access to the devices that are not local. In one 

Wall-1 product by Check Point Software Technologies Lim- 25 embodiment, Netdisk provides such a function. Such a 

ited which controls access, connection, and provides for software solution requires the use of the communication 

authentication. Note that the addition of security to a cluster medium and is inherently inferior in performance to a 

greatly increases the cost and complexity of communication solution based on a multi-ported architecture. We will dis- 

among nodes and significantly reduces the performance of cuss the issues of performance and data availabiUty a bit 

that system. Due to the performance requirements of high- 30 more below. For the remainder of this section we will 

end systems, such systems typically incorporate security concentrate on the issue of data integrity, 

checks in the software layer that interacts directly with the Asystem with dual-ported storage subsystem is shown in 

public network and assume that the member nodes are FIG. 1. In a traditional cluster, data integrity in asystem like 

trusted so that the distributed protocols, such as that of FIG. 1 is protected at three levels. First, we employ 

membership, do not need to embed security in their designs. 35 a robust membership algorithm that guarantees a single 

SUMMARY OF THE INVENTION ^'^"'^^ ^'"^^ will continiie to fuQction in the case the 

communication medium fails. Second, we use fail-fast driv- 

In accordance with the present invention, data integrity ers to guarantee that nodes that are "too slow*' will not cause 

and availability is assured by preventing a node of a problems by not being able to follow the distributed proto- 

distributed, clustered system from accessing shared data in 40 cols at the appropriate rate. Third, and as a last resort, we use 

the case of a failure of the node or communication links with the low level SCSI-2 reservations to fence nodes that are no 

the node. The node is prevented from accessing the shared longer part of the cluster from accessing the shared database, 

data in the presence of such a failure by ensuring that such As the issue with multi-ported devices is that exclusive 

a failure is detected in less time than a secondary node would reservations such as those of SCSI-2 standard do not satisfy 

allow user I/O activities to commence after reconfiguration. 45 the requirements of the system, let us concentrate on the 

The prompt detection of failure is assured by periodically issue of disk-fencing, 

determining which configuration of the current cluster each Disk-fencing is used as a last resort to disallow nodes that 

node believes itself to be a member of. Each node maintains are not members of the cluster from accessing the shared 

a sequence number which identifies the current configura- database. It operates by an lOCTL reserve command that 

tion of the cluster. Periodically, each node exchanges its 50 grants exclusive access to the node requesting the reserva- 

sequence number with all other nodes of the cluster. If a tion. The question to ask is what classes of failures does this 

particular node detects that it believes itself to be a member disk fencing protect the system from? Clearly, this low level 

of a preceding configuration to that to which another node fencing does not protect the system from truly malicious 

belongs, the node determines that the cluster has been users. Such adversaries can gain membership into the cluster 

reconfigured since the node last performed a reconfigura- 55 and corrupt the database while the system's guards are 

tion. Therefore, the node must no longer be a member of the down. Some less intelligent adversaries who cannot gain 

cluster. The node then refrains from accessing shared data. membership, e.g., because they may not be able to obtain 

In addition, if a node suspects a failure in the cluster, the root^s password, may be protected against if they happen to 

node broadcasts a reconfigure message to all other nodes of get into a node that is not a member. However, if an 

the cluster through a pubKc network. Since the messages are adversary can gain access to a non-member node of the 

sent through a public network, failure of the private com- cluster, could he or she not, and just as easily, gain access to 

munications links between the nodes does not prevent * member node of the cluster? The final analysis regarding 

receipt of the reconfigure messages. malicious adversaries, regardless of their level of 

RRTFF nPSPRTPTTOM OF THF HR AWTMr^s sophistication, is that we do not protect against them and 

BRIEF DESCRIPTION OF THE DRAWINGS ^^^^ ^^^^ ^^^^^ ^^^^ 

FIG. 1 is a block diagram of a distributed computer The second category of failures defined above is inad- 

system which includes dual-ported devices. vertent failures. Does disk-fencing provide any protection 
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from inadvertent failures? Once again, the answer is "par- While the use of disk-fencing is correct in principle, a 

tially." While nodes that arc not members of the clusters closer look at the software used to implement the implicit 

cannot write to (or read from) the shared database, those that guarantee of data integrity reveals some hidden "windows" 

are members can do so and the system relies on the users that can violate such a guarantee. Furthermore, under most 

with the appropriate privileges to not interfere with the 5 common operating conditions the low level reservations do 

database and corrupt it. For instance, if the operator is not add any value to the system as far as data integrity is 

logged into a node that is a cluster member and starts to concerned. To understand the nature and extent of the 

delete files that are used by ^e database, or worse, starts to "windows" that would violate the impUcit guarantee of data 

edit them, the system provides no protection against such ^^^^^ j^^j^ ^^^^1 implementation of 

mterventions. While the user s intent is not that of a mah- „ j- , r • x *• 1 * ur\ - 

J »u fc 4 J L • L *i- 10 disk-icncing. In some conventional systems, user I/O is 

cious adversary the effect and mechamsms he uses are the , ,1 , . .1 . ^ , ^ • 

same and therefore, we cannot effectively guard against such mdeed allowed pnor to the commencement of disk-fencing 

inadvertent failures ^ clearly is a flaw that has been over-looked m such 

The third category of failures defined in the previous systems and allows for the possibility of multiple masters for 

section is no-fault failures. A system, whether a clustered ^ ^^^^ ^ ^^^^^^ ^^^^ ^^^^^ several mmutes. This 

system or a single node system, should not allow such flaw in data integrity assurance is specific to application that 

failures to result in data corruption. The problem is a trivial tise the CVM Cluster Volume Manager. This flaw, however, 

one for a single node system. If the system goes down, then also points out to a second phenomenon. Despite significant 

there is no chance that the data will be cormpted. WhUe usage for a significant period of time of such distributed 

obvious, this has an implication for the clustered systems; systems, there have been no indications, as measured by 

nodes that are down do not endanger the integrity of the 20 customer complaints or number of filed bug reports, that 

database. Therefore, the only issue that can cause a problem data integrity has been compromised, 

with the data integrity in the system is that the nodes cannot The lack of complaints regarding data integrity clearly 

communicate with each other and synchronize their access demonstrates that our three tier protection scheme to ensure 

to the shared database. Let's look at an example of how such data integrity is in fact more than adequate and that the first 

a failure can lead to data corruption and investigate the root 25 two layers of protection seem to sufi&ce for all practical 

cause of such a corruption. cases. A closer look at the timing of events in the system and 

Assume that the system of FIG, 1 is used to run an OPS comparing the maximum duration of time required to detect 
application. This means that the access to the shared data- the failure of the communication medium, T^, with the 
base is controlled by the Distributed Lock Manager (DLM). minimum amount of time required to go through the steps of 
Now assume that node 102Ais unable to communicate with 30 reconfiguration before user I/O is allowed, Tj,, show that in 
the rest of the cluster. According to some membership all practical cases, i.e., those cases where there is some 
algorithms, this failure wiU be detected and the remaining amount of data to protect, the membership algorithm and the 
nodes of the system; nodes 102B, 102C, and 102D will form fail-fast driver implementation are adequate means of guar- 
a cluster that will not include node 102A. If node 102Adoes anteeing the integrity of the database. In general, if we can 
not detect its own failure in a timely manner, then that node 35 guarantee that T^<T^, then we can guarantee that the 
will assume that it is still mastering the locks that were membership algorithm will ensure the data integrity of the 
trusted to it. As part of the cluster reconfigiiration, other shared database. It should also be pointed out that many 
nodes wiU also think that they are mastering the locks commercial companies who are in the business of fault- 
previously owned by node 102 A. Now both sides of this tolerant systems, e.g., Tandem Computers Incorporated or 
cluster, i.e., the side consisting of node 102A and the side 40 LB.M., do not provide the low-level protection of disk 
with the membership set //={102B, 102C, 102D}, assume reservation for ensuring that the data trusted to them remains 
that they can read from and write to the data blocks that they integral. Therefore, what really is the key to providing true 
are currently mastering. This may result in data corruption if data integrity guarantees to the users of a cluster is not the 
both sides decide to write to the same data block with low level reservations, but the timely detection of failures, 
different values. This potential data integrity problem is the 45 and in particular, the timely deteaion of failures in the 
reason why disk-fencing is used. Disk-fencing does cut the communication medium. Disk-fencing although not strictly 
access of the nodes that are not part of the current mem- necessary is an extra security measure that eases the cus- 
bership to the shared devices by reserving such devices tomer's mind and is a strong seUing point in commercial 
exclusively. In our example node 1026 would have reserved products. Below, we will propose techniques that guarantee 
all the disks in storage device 112 and node 102D would so timely detection of communication medium failures. While 
have done the same for all the disks in storage device 114. these techniques are proposed for a multi-ported 
These reservations would have effectively cut off node 102A architecture, they are equally applicable to, although perhaps 
off from the shared database and protect the integrity of the not as necessary for, dual-ported systems, 
data. 4 Proposed Multi-Ported Architectures 

It is interesting to note that while the above example deals 55 The system shown in FIG. 2 is a minimal multi-ported 

with OPS that allows simultaneous access to the shared architecture. In such a system all the storage devices are 

devices, the same arguments can be used for other applica- connected to all the nodes of the cluster. For example, in 

tions that do not allow simultaneous access to the shared FIG. 2pthe^storage de.vice^^ 

data. This observation is based on the fact that the under- (j-^ — 202^3^^[^5^^^^r202DHn'a typica may 

lying reason for data corruption is not the nature of the 60 be^any more devices such as storage device 210 that are 

application but the inability of the system to detect the connected to all the nodes. The storage devices that allow 

failure of the communication medium in a timely manner such connectivity are Sonomas and Photons. Photons need 

and therefore, allowing two primary components to operate a hub to allow such direct connectivity, at least at the present 

on the same database. For all applications that need to take time, but the nature of the data integrity problem remains the 

over the data accessible from a node that is no longer part of 65 same whether there is a hub present in the storage sub- 

the membership set a similar reservation scheme is needed system or not. We will discuss the issues of performance and 

and the current software does provide such a mechanism. data availability of such an architecture in the next section. 
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Before we discuss the proposed modifications to the section are based on the observation of the previous section; 

system software for ensuring the integrity of data, we should if we can ensure that the following inequality is satisfied, 

explore the conditions under which the SCSI-2 reservations then we are guaranteed that the data will remain integral for 

are not adequate. There are two sets of applications that can no-fault class of failures: 
exist on some distributed systems^ne-set is composed of-j 5 

EOPS and-allows simultaneous access to jhe same s hared^ata m fr i w /r \ 

bloclrb3rmorcJhan.a>inglein^^ "^i «/ 

sefallows only a single node to access a block of data at any In equation (1), is the time it takes to detect the failure 

given time. These two sets can be differentiated by the of the communication medium and start a reconfiguration 

volume managers that they use. OPS must use CVM while lo and T^j is the time it takes for the reconfiguration on a 

all the other applications can use VxVM. Note that some secondary node to allow user I/O activities to commence, 

applications, i.e., as HA-NFS or Intemet-Pro, can use either The current value of minimum value of Tj, is 5 seconds, but 

of the two volume managers as CVM is indeed a super-set that value is for a system without any application or database 

of VxVM. Let's first discuss the situation for VxVM based on it. A more typical value for T^ is in order of (anywhere 

applications and then move to the OPS. 15 from 1 to 10) minutes. This imphes that the maximum value 

Applications that use VxVM are based on the assumption of T^ must be less than 5 seconds if we intend to be strictly 

that the underlying architecture is a shared-nothing archi- correct. The solutions proposed in this section are comple- 

tecture. What this means is that the applications assume that mentary. In fact, the first solution, called disk-beat, guaran- 

there is only one "master'* that can access a data item at any tees that the inequality of equation (1) is met, and therefore, 

giventime.Therefore, when that master is no longer part of 20 is sufficient. The second solution is an added security 

the cluster the resources owned by it will be transferred to measure and does not guarantee data integrity by itself, but 

a secondary node that will become the new master. This it can help the detection process in most cases. Therefore, 

master can continue to reserve the shared resources in a the implementation of the second solution, which rehes on 

manner similar to the dual-ported architectures, however, reconfiguration messages over the pubUc-net, is optional, 

this imphes that there can only be one secondary for each 25^_In-th e-disk-bcat-solutionrthe-G lust6r-Mcmbership-Moqi- 
resource. This in fact limits the availabihty of the cluster as — -toF(GMN^-creates-an-V04hread-that-writeFCo^a-predefined 

multiple secondaries that can be used are not usedrtb location-on-the-shared-storage-device-its sequence-nimiber? 

overcome this fimitation we can use the forced reserve ,,-^-In^one2emho"dirnent~tfierseq^ 

lOCTL, but this means that the current quorum algorithihT ^ber noAes:throU£b:h eart^jeat messa ges (steps 302lLn(i'304 

which is also based on SCSI-2 reservations can no longer be 30 in FIG. 3). The sequence number represents, in one 

used. There are three solutions to the problem of quomm; embodiment, the number of times the cluster to which the 

first, we can set aside a disk (or a controller) to be used as subject node belongs has been reconfigured. In addition, the 

the quomm device and do not put any shared data on that same thread also will read the sequence number of the other 

device. Since the membership algorithm guarantees nodes in the cluster (step 306). If it finds that its sequence 

(through the use of the quorum device) that there will be 35 number is lagging behind that of any other cluster members, 

only one primary component in the system, we can be sure i.e., is less than the sequence number of any other node (test 

that nodes doing the forced reservation for the shared step 308), then it will execute a reconfiguration (step 310). 

devices are indeed the ones that should be mastering these The location of the "array" of sequence numbers and their 

devices in the new membership set. A second solution is to relative ordering is specified by the Quster Data Base 

bring down the cluster when the number of failures reaches 40 (CDB) file. The read and write activity is a periodic activity 

N-2, where N is the number of nodes in the cluster. This that takes place with period T, where T^Max{T^}. Note 

obviously reduces the availability of the system, but such a that as the thread will be a real-time thread with the highest 

cluster would not need to xise a quorum device as such priority and the processor that it will be executing on does 

devices are only used when N=2. Both of these solutions are not take any interrupts, therefore, the execution of the 

not optimal. For the first solution we need to have a disk (for 45 periodic read and write as well as the commencement of the 

those systems without a controller) reserved for the purpose reconfiguration is guaranteed to happen within the specified 

of quorum and for the second solution we are not utifizing time period. Further note that the disk-beat solution is 

the full availability of the cluster, A third, albeit more merely a way of preventing two sides of a disjoint cluster 

complex, solution avoids both of these pitfalls. This solution from staying up for any significant period of time. Finally, 

was originally designed for a system with SDS volume 50 note that this solution is asynchronous with respea to the 

manager, but is equally appHcable to a system where nodes reconfiguration framework and is more timely than the one 

use forced reservation. In short, apphcations using VxVM used in current systems. 

can be made ultra-available (i.e., as long as there is a path The second, and optional, solution relies on the public net 

to their data they will be able to operate) and preserve the for delivery of a message and is based on a push technique, 

integrity of the data trusted to them in a multi-ported 55 This message is in fact a remote shell that will execute on all 

architecture (such as the one shown in FIG. 2) using the nodes of the cluster. Once a node hits the return transition, 

currently supported SCSI-2 reservations. which may indicate that a failure has occurred, that node will 

' The situation for OPS is slightly more complex. 0PSls3 do a rsh clustm reconfig operation on all other nodes of the 

'^based on- the assumption that-each^nod e^that'is runnih g^ cluster (step 402 in RG. 4). Since reconfiguration is a cluster 

~ihstanc.c^;QgS"isindccd capable of accessing"^^^ wide activity and since most of the time the nodes would go 

^~in„the_database,_whether_that-data_is^locd_oTjnot; This through the reconfiguration in a short period of time, the 

assumption essentially eliminates the use of reservations that addition of this call does not introduce a significant overhead 

are exclusive. While the best solution for ensuring data in terms of unnecessary reconfigurations. As mentioned 

integrity is to make sure that the optional SCSI-3 persistent earUer, however, this solution does not guarantee the data 

group reservations are implemented by the disk drive 65 integrity as the disk-beat solution does and will only be used 

vendors, there are alternate solutions that can satisfy the data in the system if it is deemed helpful for quicker detection of 

integrity requirements. The two solutions proposed in this failures. 
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In any system there is a small, but non-zero, probability 
that the system is not making any forward progress and is 
"hung". In a clustered system, a node that is hung can cause 
the rest of the cluster to go through a reconfiguration and 
elect a new membership that does not include the hung node. 5 
While the disk-beat solution of previous section guarantees 
that a node will detect the failure of the communication 
medium, it is based on the assumption that this node is 
active. The fencing mechanisms, such as those based on the 
SCSI-2 exclusive reservation schemes, can protect the data 10 
integrity in the case of nodes that are inactive, or hung, for 
a period of time and then come back to life and start issuing 
I/O commands against the shared database. To raise the level 
of protection against such failures for systems that cannot 
use the exclusive reservations, we use a second protection 15 
mechanism based on the fail-fast drivers. This scheme is 
already in place, implemented as part of the original mem- 
bership monitor, and works by having a thread of the CMM 
arming and re-arming a fail-fast driver periodically. If the 
node cannot get back to re-arm the driver in its allotcd time, 20 
the node will panic. Let's say that the periodic re-arming 
activity will happen with period T^ If v^e can guarantee the 
inequaUty of equation (2) is satisfied then we can guarantee 
that there will be no new I/O issued against the shared data 
base when the node "wakes up". 25 

Max{r^}<Min{r4 (2) 

It is important to note that the fail fast driver is scheduled 
by the clock interrupt, the highest priority activity on the 30 
entire system. However, the fail fast does not happen as part 
of the interrupt and there is a window during which the I/O 
already committed to the shared database may be written to 
it. Since the fail fast driver is handled by the kernel and has 
the priority of the kernel, this window will be small. In 35 
addition, nodes that are hung are typically brought back to 
life via a reboot. This, by definition, eliminates this node as 
a potential source of data cormption. The only situation 
which may allow our proposed system to compromise the 
integrity of the data due to a no-fault failure is a hung node 40 
that is brought back to life as a member of the cluster and has 
user I/O queued up in front of the two highest priority tasks 
in the system. Needless to say, this is an extremely unlikely 
scenario. 

5 Analysis and Conclusions 45 

Above, we have looked at the issue of data integrity for 
two different architectures of clustered systems. The differ- 
ence in the cluster architecture is due to the differences in the 
storage sub-system. In one case the storage sub-system is 
entirely composed of dual-ported storage devices and in the so 
other case the storage sub-system is entirely composed of 
multi-ported devices. We classified and analyzed the various 
types of failures that a clustered system can see and tolerate 
and discussed the methods the system employs in the current 
dual-ported architectures to ensure data integrity in presence 55 
of such failures. 

As pointed out above for applications that utihze the 
VxVM volume manager, the addition of multi-ported stor- 
age sub-system does not pose any danger to the data 
integrity. In fact, for such systems the multi-ported storage 60 
sub-system increases the degree of high- availability to 
(N-1), where N is the number of nodes in the cluster. To 
achieve this level of high-avai lability we found out that 
some modification to the quorum algorithm is necessary. 

For the apphcations that use CVM as their volume 65 
manager, i.e,, OPS, the lack of low level SCSI-2 reservations 
in multi -ported architectures can be overcome with the 
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introduction of disk-beat. This technique will protect the 
data integrity against the same class of failures that disk 
fencing does and showed an additional algorithm based on 
the use of public net that can further improve the timing of 
reconfigurations, in general, and detection of failed commu- 
nication medium, in particular. W^le a multi-ported archi- 
tecture does not suffer from a lack of guaranteed data 
integrity in out system, it enjoys the inherent benefits of 
additional availability and increased performance. Perfor- 
mance is enhanced on several fi-onts. First, no software 
solution is needed to create the mirage of a shared-disk 
architecture. Second, with the use of RAID -5 technology the 
mirroring can be done in hardware. Finally, such the system 
can tmly tolerate the failure of N-1 nodes, where N is the 
number of nodes in the cluster, and continue to provide full 
access to all the data. 

The comparison of dual -ported and multi-ported archi- 
tectures done in this paper clearly indicate the inherent 
advantages of the multi-ported systems over dual-ported 
systems. However, dual-ported systems are able to tolerate 
some benign malicious failures and pilot errors that are not 
tolerated by multi-ported systems. While these are nice 
features that will be incorporated into the multi-ported 
architectures with the implementation of SCSI-3 persistent 
group reservations, they only affect OPS and are not general 
enough to guard the system against an arbitrary malicious 
fault or even an arbitrary inadvertent fault. 

The above description is illustrative only and is not 
Mmiting. The present invention is therefore defined solely 
and completely by the appended claims together with their 
fiill scope of equivalents. 

What is claimed is: 

1. A method for operating a clustered computer system 
including at least a first node and a second node, the method 
comprising: 

a first software controlled process executing m said first 
node, wherein the first software controlled process is 
configured to cause a reconfiguration of a cluster con- 
figuration in response to detection of a failure; and 
a second software controlled process executing in said 
second node, wherein said second software controlled 
process is configured to detect said failure within a 
period of time which is less than the time for perform- 
ing said reconfiguration, and wherein said second soft- 
ware controlled process is executed in real time to 
guarantee that said second software controlled process 
detects said failure within the time for performing said 
reconfiguration; 
wherein said first node and said second node are config- 
ured to each separately detect said failure by: 
writing a first sequence number identifying a particular 
configuration of a cluster to which each node is a 
member to a shared storage device; 
reading a second sequence number from each other 

node in said cluster; 
comparing said first sequence number written to said 
shared storage device to said second sequence num- 
ber read from each other node in said cluster; and 
initiating said reconfiguration of said cluster configu- 
ration if said first sequence number written to said 
shared storage device is less than said second 
sequence number read from each other node in said 
cluster. 

2. The method as recited in claim 1, wherein said second 
software controlled process is executed periodically with a 
frequency which has a period less than a period of dme for 
performing said reconfiguration of said cluster. 
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3. The method as recited in claim 1, wherein said method 
further comprises sending a reconfiguration message to each 
other node in said cluster through a public communication 
medium. 

4. A computer readable medium useful in association with 5 
a clustered computer system which includes a first node and 

a second node^ the computer readable medium including 
computer instructions executable to implement a method 
comprising: 

a first software controlled process executing in said first 
node, wherein the first software controlled process is 
configured to cause a reconfiguration of a cluster con- 
figuration in response to detection of a failure; and 
a second software controlled process executing in said 
second node, wherein said second software controlled 
process is configured to detect said failure within a 
period of time which is less than the time for perform- 
ing said reconfigm-ation, and wherein said second soft- 
ware controlled process is executed in real time to 
guarantee that said second software controlled process 
detects said failure within the time for performing said 
reconfiguration; 
wherein said first node and said second node are config- 
ured to each separately detect said failure by: 
writing a first sequence number identifying a particular 
configuration of a cluster to which each node is a 
member to a shared storage device; 
reading a second sequence number from each other 

node in said cluster; 
comparing said first sequence number written to said 
shared storage device to said second sequence num- 
ber read from each other node in said cluster; and 
initiating said reconfiguration of said cluster configu- 
ration if said first sequence number written to said 
shared storage device is less than said second 
sequence number read from each other node in said 
cluster. 

5. The computer readable medium as recited in claim 4, 
wherein said second software controlled process is executed 
periodically with a frequency which has a period less than a 
period of time for performing said reconfiguration of said 
cluster. 

6. The computer readable medium as recited in claim 4, 
wherein said method further comprises sending a reconfigu- 
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ration message to each other node in said cluster through a 
public communication medium. 

7. A computer system comprising: 
a first node; 

a second node coupled to said first node, 
wherein said first node is configured to execute a first 
software controlled process which causes a reconfigu- 
ration of a cluster configuration in response to detection 
of a failure; and 
wherein said second node is configured to execute a 
second software controlled process which detects said 
failure within a period of time which is less than the 
time for performing said reconfiguration, and wherein 
said second software controlled process is executed in 
real time to guarantee that said second software con- 
trolled process detects said failure within the time for 
performing said reconfiguration; 
wherein said first node and said second node are further 
configured to each separately delect said failure by: 
writing a first sequence number identifying a particular 
configuration of a cluster to which each node is a 
member to a shared storage device; 
reading a second sequence number from each other 

node in said cluster; 
comparing said first sequence number written to said 
shared storage device to said second sequence num- 
ber read from each other node in said cluster; and 
initiating said reconfiguration of said cluster configu- 
ration if said first sequence number written to said 
shared storage device is less than said second 
sequence number read fi:om each other node in said 
cluster. 

8. The computer system as recited in claim 7, wherein said 
second software control process is executed periodically 
with a frequency which has a period less than a period of 
time for performing said reconfiguration of said cluster. 

9. The computer system as recited in claim 7, wherein said 
step of initiating a reconfiguration of said cluster comprises 
sending a reconfiguration message to each other node in said 
cluster through a public communication medium. 
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